KNACK-2002: a Richly Annotated Corpus of Dutch Written Text

نویسندگان

  • Véronique Hoste
  • Guy De Pauw
چکیده

In this paper, we introduce the annotated KNACK-2002 corpus of Dutch written text. The corpus features five different annotation layers, ranging from the annotation of morphological boundaries at the word level, over the annotation of part-of-speech tags and phrase chunks at the syntactic level to the annotation of named entities at the semantic level and coreferential relations at the discourse level. We believe the corpus is unique in the Dutch language area because of its richness of annotation layers, providing researchers with a useful gold standard data set for different NLP tasks in the domains of morphology, (morpho)syntax, semantics and discourse.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Large Scale Syntactic Annotation of Written Dutch: Lassy

The construction of a 500-million-word reference corpus of written Dutch has been identified as one of the priorities in the STEVIN programme. The focus is on written language in order to complement the Spoken Dutch Corpus (CGN) [13], completed in 2003. In D-COI (a pilot project funded by STEVIN), a 50-million-word pilot corpus has been compiled, parts of which were enriched with verified synta...

متن کامل

Adding Semantic Role Annotation to a Corpus of Written Dutch

We present an approach to automatic semantic role labeling (SRL) carried out in the context of the Dutch Language Corpus Initiative (D-Coi) project. Adapting earlier research which has mainly focused on English to the Dutch situation poses an interesting challenge especially because there is no semantically annotated Dutch corpus available that can be used as training data. Our automatic SRL ap...

متن کامل

Huge Parsed Corpora in LASSY

One of the goals of the LASSY STEVIN project (Large Scale Syntactic Annotation of written Dutch) is a syntactically annotated (manually verified) corpus of 1 million words. In addition, the full STEVIN reference corpus of 500 million words will be syntactically annotated automatically. In this paper, the potential of such huge treebanks for applications in corpus linguistics, natural language p...

متن کامل

Construction of an aligned monolingual treebank for studying semantic similarity

Modern paraphrase research would benefit from large corpora with detailed annotations. However, currently these corpora are still thin on the ground. In this paper, we describe the development of such a corpus for Dutch, which takes the form of a parallel monolingual treebank consisting of over 2 million tokens and covering various text genres, including both parallel and comparable text. This ...

متن کامل

Enhanced Dialogue Markup for Crisis Talk Scenario Resources

We present a method of enhancing dialogue markup by mapping HPSG-based discourse category information into XML. The application scenario is crisis talk, specifically cockpit voice recording (CVR) transcripts of aviation disasters. This approach is new both as a source of richly annotated spoken language corpus resources for a little known scenario, and in grammatical theory and language documen...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006